library( tidyverse )
library( knitr )
library( kableExtra )

The Binomial

For cases where we have things that can be put into two groups.

\[ P(x=K|N,p) = \frac{N!}{ K! (N-K)!}p^K(1-p)^{N-K} \]

Used as the generative function for Bayesian Inferences.

The Binomial Test

Can be used to test if a set of data are equivallent to an expectation given by the underlying probability of seeing an item in “Group 1” (catfish in this example):

 

\(H_O: p = \hat{p}\)

 

For this, the test statistics is defined as the number of observations in “Group 1” (e.g., \(N_{catfish}\) in this example), and what we know about the probability of selecting items that have two distinct states is used to evaluate the likelihood of getting \(N_{catfish}\).

The Data

Borrowing from a previous example, let’s assume we are sampling fish in the James River. The hypothesis is that the frequency of catfish is

\[ p_{catfish} = \frac{N_{catfish}}{N_{catfish} + N_{other}} = \frac{37}{50} \approx 0.74 \]

So if if we have another set of sample, we can test the null hypothesis:

\(H_O: p = 0.74\)

Using the new counts of \(N_{catfish}\) and \(N_{otehr}\).

Built-In Testing Function

R has a direct function for this:

 

That allows testing two-tailed or single-sided alternative hypotheses.

Example With New Data

So, let’s assume you went out and sampled new data and found

catfish <- 52
non_catfish <- 47 

 

To test if this sampling exercise is consistent with the previous assumption of a 74% catfish presence, we can specify the data as:

samples <- c( catfish, non_catfish )
p_catfish <- 0.74

The Binomial Test

fit <- binom.test( samples, p = p_catfish)
fit

    Exact binomial test

data:  samples
number of successes = 52, number of trials = 99, p-value = 5.007e-06
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
 0.4223973 0.6265570
sample estimates:
probability of success 
             0.5252525 

Conclusion?

Contained Data

As normal, the object that is returned by the function has all the necessary items for you to include the data into your manuscript.

names(fit)
[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  

Multinomial Extension

When we have more than two unique states, we can extend the Binomial test to a more generalized Multinomial expansion. This would be used to test the hypothesis about specific probabilities for \(K-1\) states (the number of independently assorting groups). However, this is a specific version of a more generalilzed approach using Contingency Tables and I won’t be going over it here.

However, there are a few libraries in R available if you’d prefer to use the multinomial expansion.

Contingency Tables

Contingency tables are another counting approach, where your data can be classified into one or more categories. Often it is taken as a comparison between populations. So, in keeping with our fish theme, let’s assume we went to another river and did another sampling run on catfish. In general, we are not limited to “Group A” vs “Group B” but we will use it for continuity here.

Nomenclature

This is defined as an \(RxC\) Contingency Table, where \(R\) is the number of Rows and \(C\) is the number of columns.

In this example, I am going to assume that the rows are independent sampling events and the columns represent the grouping of individual sample observations.

Underlying Hypothesis

For this, we are assuming that the observations in the row are taken from a larger background population.

  • \(p_{catfish,\; pop1} = \frac{O_{11}}{n_1}\)
  • \(p_{other,\; pop1} = \frac{O_{12}}{n_1}\)
  • \(p_{catfish} + p_{other} = 1.0\)

The same applies for samples from Population 2.

\(H_O: p_{catfish, \; pop1} = p_{catfish, \; pop2}\)

Larger Configurations

If we had more than just two Groups, we expand these to the vector of probabilities by row.

\(H_O: \vec{p}_i = \vec{p}_j\)

 

And the same applies for more than just two rows.

Test Statistic

The test statistic for this is based upon:

The distance between the observed and expectations for the data.

Observed

The count of items in each combination of categories. - \(O_{11}\), \(O_{21}\), etc.

Expected

Based on number of categories OR external information.

Observed Values - Example Data

Let’s assume I went out and sampled 50 more fishes and my overall data are now:

Observed <- matrix( c(52, 47, 32, 18), nrow=2, byrow = TRUE)
rownames( Observed ) <- c("Population 1", "Population 2")
colnames( Observed ) <- c("Catfish", "Other")
Observed
             Catfish Other
Population 1      52    47
Population 2      32    18

Defining Expectations - No Information

If we have no data, and the underlying hypothesis is that samples may be put into any of the Groups randomly (e.g., the underlying assumption that all fish species have equal abundance), then the expecation is that the frequencies of observations would be \(\frac{1}{K}\), where \(K\) is the number of groups.

Expectations - No Information

n1 <- 52 + 47
n2 <- 32 + 18 
Expected <- matrix( c( 0.5 * n1, 
                       0.5 * n1, 
                       0.5 * n2,
                       0.5 * n2), nrow=2, byrow = TRUE)
rownames( Expected ) <- c("Population 1", "Population 2")
colnames( Expected ) <- c("Catfish", "Other")
Expected
             Catfish Other
Population 1    49.5  49.5
Population 2    25.0  25.0

Expecations - Prior Information

In this example, we have a prior expectation that:

\[ p_{catfish} = 0.74 \]

\[ p_{other} = 1.0 - 0.74 = 0.26 \]

Expecations - Prior Information

Expected <- matrix( c( 0.74 * n1, 
                       (1-0.74) * n1, 
                       0.74 * n2,
                       (1-0.74) * n2), nrow=2, byrow = TRUE)
rownames( Expected ) <- c("Population 1", "Population 2")
colnames( Expected ) <- c("Catfish", "Other")
Expected
             Catfish Other
Population 1   73.26 25.74
Population 2   37.00 13.00

Observations and Expectations

The test statistic should measure how close the data in these two matrices are.

 

Observed
             Catfish Other
Population 1      52    47
Population 2      32    18
Expected
             Catfish Other
Population 1   73.26 25.74
Population 2   37.00 13.00

 

  • Small deviations \(\to\) no differences and failure to reject \(H_O\).
  • Big deviations \(\to\) rejection of \(H_O\).

Test Statistic

This is based upon an approximation to the \(\chi^2\) distribution, which is well behaved and described for counting data.

 

\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

Estimating Directly

I’m going to show you how to estimate this from the raw count data so we can get an example where you have to look up the probability for an underlying distribution. There is a chisq.test() function, but you have to be very careful that you have it set up properly for contingency table analyses.

T <- sum( (Observed - Expected)^2 / Expected  )
T
[1] 26.32813

Rejecting the Null Hypothesis

The general contingency table approach is based upon \(df = (r-1)(c-1)\) independent elements and is distributed as a \(\chi^2\) random variable. In a classic ‘hypothesis testing’ framework with \(\alpha = 0.05\), we would estimate the critical value of the test statistic as:

alpha <- 0.05
p <- 1.0 - alpha 
critical_value <- qchisq(p, df=1 )
critical_value
[1] 3.841459

The Probability of the Test Statistic

To get the actual probabilty associated with $T =$26.33, we pass the test statistic to the quantile function for the \(\chi^2\) distribution.

p.value <- 1.0 - pchisq( T, df=1 )
p.value
[1] 2.88063e-07

Tabulating Data

  Sample Species         Site Measurement
1      1 Catfish Population 1    18.01923
2      2 Catfish Population 1    12.63293
3      3 Catfish Population 1    12.29715
4      4 Catfish Population 1    12.82386
5      5 Catfish Population 1    11.64118
6      6 Catfish Population 1    12.05839

Tabulating Data

The table() function can take the columns of factors and provide for you the observed matrix of counts. If you name your factors descriptively, it will take care of the row and column headers for you.

table( df$Site, df$Species ) -> Obs
Obs
              
               Catfish Other
  Population 1      52    47
  Population 2      32    18

What do we do when our data are not normal?

Non-Parametric Analyses

There are equivallent tests (ok, not equivallent but can be used to answer a similar question) that do not have assumptions of normality, homoscedasticity, etc.

Important

These are largely based upon Ranks and Combinatory Theory.

You gain the applicability with a potential loss of information (and hence statistical power).

All of these methods are based upon ranks.

Non-Parametric Correlation

The normal cor.test() function has an option to swap out the Pearson-Product Moment approach for one based on Ranks.

Example Comparison

I’m going to use the Iris data set to look at a correlation for relatively poorly behaved data.

Example Comparison

fit_pearson <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
fit_kendall <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "kendall" )
fit_spearman <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "spearman" )
data.frame( Model=c("Pearson","Kendall","Spearman"), 
            Parameter = c("r","tau","rho"),
            R = c(fit_pearson$estimate, fit_kendall$estimate, fit_spearman$estimate),
            P = c(fit_pearson$p.value, fit_kendall$p.value, fit_spearman$p.value ) ) -> df 

Example Comparison

Here is the difference in the models (the ties impacted the Spearman p.value estimate).

df |>
  kable( digits = 3, row.names = FALSE) |> 
  kable_classic(  full_width=FALSE )
Model Parameter R P
Pearson r -0.118 0.152
Kendall tau -0.077 0.183
Spearman rho -0.167 0.041

Non-Parametric Regression

There are a few different methods for estimating linear regression models via non-parametric approaches. However, they’d require a bit more background on counting statistics and I’m going to pass on this here.

If you need to do non-parametric regression, come see me and we’ll spend a bit of time getting the foundation set for it.

Non-Parametric T-Test: Mann-Whitney

This is an extension of the t-test for the equivalence of the mean values of two data sets (e.g., \(H_O: \bar{x} = \bar{y}\)).

Mann-Whitney Assumptions

This approach has the following assumptions:

  1. Both sets of data are random samples from their respective population.
  2. Each of the populations are similarly independent.
  3. The measurement scale for each variable is at least ordinal (e.g., you can put them in order).

Example Data - Mann-Whitney

mtcars |>
  select( Transmission = am, MPG = mpg ) |>
  mutate( Transmission = as.factor( Transmission ) ) |>
  mutate( Transmission = fct_recode( Transmission, "Automatic"="0", "Manual"="1")) -> data

data |>
  ggplot( aes(Transmission, MPG) )  + 
  geom_boxplot( notch=TRUE )

Non-Parametric ANOVA

Questions?

If you have any questions, please feel free to post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored